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Abstract — In this paper, we present a low-complexity algorithm 
for detection in high-rate, non-orthogonal space-time block coded 
(STBC) large-MIMO systems that achieve high spectral efficien- 
cies of the order of tens of bps/Hz. We also present a training- 
based iterative detection/channel estimation scheme for such 
large STBC MIMO systems. Our simulation results show that 
excellent bit error rate and nearness-to-capacity performance 
are achieved by the proposed multistage likelihood ascent search 
(A/-LAS) detector in conjunction with the proposed iterative 
detection/channel estimation scheme at low complexities. The 
fact that we could show such good results for large STBCs like 
16 X 16 and 32 x 32 STBCs from Cyclic Division Algebras (CDA) 
operating at spectral efficiencies in excess of 20 bps/Hz (even after 
accounting for the overheads meant for pilot based training for 
channel estimation and turbo coding) establishes the effectiveness 
of the proposed detector and channel estimator. We decode 
perfect codes of large dimensions using the proposed detector. 
With the feasibility of such a low-complexity detection/channel 
estimation scheme, large-MIMO systems with tens of antennas 
operating at several tens of bps/Hz spectral efficiencies can 
become practical, enabling interesting high data rate wireless 
applications. 

Index Terms — Large-MIMO systems, low-complexity detec- 
tion, channel estimation, non-orthogonal space-time block codes, 
high spectral efficiencies. 



I. Introduction 

Current wireless standards (e.g., IEEE 802.1 In and 802. 16e) 
have adopted MIMO techniques [1 |-13J to achieve the benefits 
of transmit diversity (using space-time coding) and high data 
rates (using spatial multiplexing). They, however, harness only 
a limited potential of MIMO benefits since they use only a 
small number of transmit antennas (e.g., 2 to 4 antennas). 
Significant benefits can be realized if large number of antennas 
are used; e.g., large-MIMO systems with tens of antennas 
in communication terminals can enable multi-giga bit rate 
transmissions at high spectral efficiencies of the order of 
several tens ofbps/H^ Key challenges in realizing such large- 
MIMO systems include low-complexity detection and channel 
estimation, RF/IF technologies, and placement of large number 
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' Spectral efficiencies achieved in current MIMO wireless standards are only 
about 10 bps/Hz or less. 



of antennas in communication terminal^ Our focus in this 
paper is on low-complexity detection and channel estimation 
for large-MIMO systems. 

Spatial multiplexing (V-BLAST) with large number of trans- 
mit antennas can offer high spectral efficiencies, but it does not 
give transmit diversity. On the other hand, well known orthog- 
onal space-time block codes (STBC) have the advantages of 
full transmit diversity and low decoding complexity, but they 
suffer from rate loss for increasing number of transmit an- 
tennas 121, Is), Q. However, full-rate, non-orthogonal STBCs 
from Cyclic Division Algebras (CDA) [7| are attractive to 
achieve high spectral efficiencies in addition to achieving full 
transmit diversity, using large number of transmit antennas. 
For example, a 32 x 32 STBC matrix from CDA has 1024 
symbols (i.e., 32 complex symbols per channel use), and using 
this STBC along with 16-QAM and rate-3/4 turbo code offers 
a spectral efficiency of 96 bps/Hz. While maximum-likelihood 
(ML) decoding of orthogonal STBCs can be achieved in 
linear complexity, ML or near-ML decoding of non-orthogonal 
STBCs with large number of antennas at low complexities 
has been a challenge. Channel estimation is also a key issue 
in large-MIMO systems. In this paper, we address these two 
challenging problems; our proposed solutions can potentially 
enable realization of large-MIMO systems in practice. 

Sphere decoding and several of its low-complexity variants 
are known in the literature 18|-[11|. These detectors, however, 
are prohibitively complex for large number of antennas. Re- 
cent approaches to low-complexity multiuser/MIMO detection 
involve application of techniques from belief propagation lfT2l . 
Markov Chain Monte-Carlo methods ifTSl . neural networks 
|[l4l,l[l5l,l[l6l, etc. In particulai-, in |[l5l,|[T6|, we presented 
a powerful Hopfield neural network based low-complexity 
search algorithm for detecting large-MIMO V-BLAST signals, 
and showed that it performs quite close to (within 4.6 dB 
of) the theoretical capacity, at high spectral efficiencies of the 
order of tens to hundreds of bps/Hz using tens to hundreds 
of antennas, at an average per-symbol detection complexity 

^WiFi products in 2.5 GHz band which use 12 transmit antennas for 
beamforming purposes are becoming commercially available |4|. With such 
RE and antenna technologies for placing large number of antennas in 
medium/large aperture communication terminals (like set-top boxes/laptops) 
getting increasingly matured, low-complexity high-performance MIMO base- 
band receiver techniques (e.g., detection and channel estimation) are crucial 
to enable practical implementations of high spectral efficiency large-MIMO 
systems, which, in turn, can enable high data rate appUcations Uke wireless 
IPTV/HDTV distribution. 
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of just 0{NtNr), where Nt and Nr denote the number of 
transmit and receive antennas, respectively. 

In this paper, we present i) a low-complexity near-ML 
achieving detector, and ii) an iterative detection/channel es- 
timation scheme for large non-orthogonal STBC MIMO sys- 
tems having tens of transmit and receive antennas. Our key 
contributions here can be summarized as follows: 

1) We generalize the 1 -symbol update based likelihood 
ascent search (LAS) algorithm we proposed in [T5l,fT6l, 
by employing a low-complexity multistage multi-symbol 
update based strategy; we refer to this new algorithm as 
multistage LAS (A/-LAS) algorithm. We show that the 
M-LAS algorithm outperforms the basic LAS algorithm 
with some increase in complexity. 

2) We propose a method to generate soft outputs from 
the M-LAS output vector Soft outputs generation was 
not considered in llT5]| . llT6l . The proposed soft outputs 
generation for the individual bits results in about 1 to 1 .5 
dB improvement in coded bit error rate (BER) compared 
to hard decision M -LAS outputs. 

3) Assuming i.i.d. fading and perfect channel state infor- 
mation at the receiver (CSIR), our simulation results 
show that the proposed M-LAS algorithm is able to 
decode large non-orthogonal STBCs (e.g., 16 x 16 and 
32 X 32 STBCs) and achieve near single-input single- 
output (SISO) AWGN uncoded BER performance as 
well as near-capacity (within 4 dB from theoretical 
capacity) coded BER performance. 

4) Using the proposed detector, we decode and report the 
simulated BER performance of 'perfect codes' lfT7l - ll2n 
of large dimensions. 

5) Presenting a BER performance and complexity com- 
parison of the proposed CDA STBC/7\f-LAS detection 
approach with other large-MIMO/detector approaches 
(e.g., stacked Alamouti codes/QOSTBCs and associated 
interference canceling receivers reported in f22l), we 
show that the proposed approach outperforms the other 
considered approaches, both in terms of performance as 
well as complexity. 

6) We present simulation results that quantify the loss in 
BER performance due to spatial correlation in large- 
MIMO systems, by considering a more realistic spatially 
correlated MIMO fading channel model proposed by 
Gesbert et al in ll23l . We show that this loss in per- 
formance can be alleviated by providing more receive 
dimensions (i.e., more receive antennas than transmit 
antennas). 

7) Finally, we present a training-based iterative detec- 
tion/channel estimation scheme for large STBC MIMO 
systems. We report BER and nearness-to-capacity results 
when the channel matrix is estimated using the proposed 
iterative scheme and compare these results with those 
obtained using perfect CSIR assumption. 

The rest of the paper is organized as follows. In Section Ull 
we present the STBC MIMO system model considered. The 
proposed detection algorithm is presented in Section |lll] BER 
performance results with perfect CSIR are presented in Section 



HV] This section includes the results on the effect of spatial 
correlation, BER performance of large perfect codes, and 
comparison of the proposed scheme with other large-MIMO 
architecture/detector combinations. The proposed iterative de- 
tection/channel estimation scheme and the corresponding per- 
formance results are presented in Section [Vl Conclusions are 
presented in Section IVll 

II. System Model 

Consider a STBC MIMO system with multiple transmit and 
multiple receive antennas. An {n,p,k) STBC is represented 
by a matrix Xc G C"^*", where n and p denote the number of 
transmit antennas and number of time slots, respectively, and 
k denotes the number of complex data symbols sent in one 
STBC matrix. The {i,j)th entry in Xc represents the complex 
number transmitted from the ith transmit antenna in the jth 
time slot. The rate of an STBC, r, is given by ?- = f ■ 
Nr and Nt = n denote the number of receive and transmit 
antennas, respectively. Let He G c^rXjVt (jgjjgte the channel 
gain matrix, where the (i,j)th entry in He is the complex 
channel gain from the jth transmit antenna to the ith receive 
antenna. We assume that the channel gains remain constant 
over one STBC matrix duration. Assuming rich scattering, we 
model the entries of He as i.i.d CJ\f{0, 1 jl The received space- 
time signal matrix, Yc G C^^^^, can be written as 

Ye = HeXe+Ne, (1) 

where Nc G C^'^^^ is the noise matrix at the receiver and its 
entries are modeled as i.i.d CAf[0,(7^ = ^^^^ ) , where Es is 
the average energy of the transmitted symbols, and 7 is the 
average received SNR per receive antenna [3J, and the («, j)th 
entry in Ye is the received signal at the ith receive antenna in 
the jth time slot. In a linear dispersion (LD) STBC, Xc can 
be decomposed into a linear combination of weight matrices 
corresponding to each data symbol and its conjugate as O 

k 

Xc = ^a;WAW + (x«)*E«, (2) 

i=l 

where xi^^ is the ith complex data symbol, and Ac*^ , Ec*^ G 
C^txp corresponding weight matrices. The detection 

algorithm we propose in this paper can decode general LD 
STBCs of the form in (|2|i. For the purpose of simplicity in 
exposition, here we consider a subclass of LD STBCs, where 
Xe can be written in the form 

k 

Xe = ^xWa«. (3) 

1=1 

From ([T]i and (|3]l, applying the vec{.) operatiorQ we have 

k 

vec{Yc) = ^a;Wwec(HcA«) +wec(Nc). (4) 

i=l 

'CA/'(0, a^) denotes a circulai'ly symmetric complex Gaussian distribution 
witli mean zero and variance cr^. 

^For a p X g matrix M = [mim2 • ■ ■ niq], where mi is the ith column of 
M, vec(M.) is a p5 X 1 vector defined as vec(M.) = [m^m|^ • ■ • m^]-^, 
where [.]-^ denotes the transpose operation. 
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If U,V,W,D are matrices such that D = UWV, then it is 
true that uec(D) = (V^(g)U) t;ec (W), where ® denotes tensor 
product of matrices Ii24j . Using this, we can write dU as 



vec (Yc) 



i=l 



c«(I®H,)wec(AW)+„ec(N,), (5) 



A 



where I is the p x p identity matrix. Further, define y, 

wec(Yc), He = (I(»Hc), aif> = vec{A[''>), and = 
vec (Nc). From these definitions, it is clear that yc G C^'^^^^, 
e C^.pxjv.p^ g c^pxi^ ^jjjj g c^-f^'i. Let 
us also define a matrix He € c^^rpxfe^ whose ith column is 
Hca^K i = 1, - ■ ■ ,k. Let Xc G C*^^^, whose ith entry is the 
data symbol x'^K With these definitions, we can write (|5]l as 

k 

yc - 5I^c'''(Hca«) + ne = HeXe + n,. (6) 

1=1 

Each element of Xc is an M-PAM or yU-QAM symbol. M- 
PAM symbols take discrete values from {Am, m = 1, ■ ■ ■ ,M}, 
where = (2m — 1 — Al), and Al-QAM is nothing but two 
PAMs in quadrature. Let yc. He, Xc, and ric be decomposed 
into real and imaginary parts as 



(7) 



yc = yi +jyQ, Xc = x/+jxq, 

ric = n/ + juQ , He = H/ + jHq 
Further, we define x^ G M^fcxi^ ^ K2W,,pxi^ jj^ ^ 
]^2N„px2fc^ and n,. e m2W'-pxi as 

x^ = [xj xg]^, y^ = [yj' y§]^, 
H,, = f - ^ 
Now, (|6]l can be written as 



= [nf n§]^. (8) 



(9) 

Henceforth, we work with the real- valued system in ([9]). For 
notational simplicity, we drop subscripts r in (|9]) and write 



Hx + n, 



(10) 



where H = H^ e m2JV,px2'=^ y = yr e r2W'-pxi, x = x^ e 
M2fcxi^ n = xir € K2Ar,pxi -fjjg channel coefficients 
are assumed to be known only at the receiver but not at the 
transmitter. Let denote the A^-PAM signal set from which 
Xi (ith entry of x) takes values, i = 1, - ■ ■ , 2k. Now, define a 
2A;-dimensional signal space § to be the Cartesian product of 
Al to A2fc. The ML solution is given by 



arg mm 

d e § 

arg min 

d e § 

whose complexity is exponential in k f25\ 



l|y-Hdr 

d^H^Hd - 2y^Hd, 



(11) 



A. High-rate Non-orthogonal STBCs from CDA 

We focus on the detection of square (i.e., n — p ~ Nt), 
full-rate (i.e., k — pn = N?), circulant (where the weight 
matrices Ac 's are permutation type), non-orthogonal STBCs 
from CDA ll26l . whose construction for arbitrary number of 
transmit antennas n is given by the matrix in (11. a) given at 
the bottom of this page [7]: 

In (ILa), uj„ = e , i — v — 1, and Xu,v, < u,v < n — 1 
are the data symbols from a QAM alphabet. When 5 — 
and t = e\ the STBC in (11. a) achieves full transmit diversity 
(under ML decoding) as well as information-losslessness Q. 
When S = t = 1, the code ceases to be of full-diversity 
(FD), but continues to be information-lossless (ILL) ll27ll . ll52ll . 
High spectral efficiencies with large n can be achieved using 
this code construction. For example, with n = 32 transmit 
antennas, the 32 x 32 STBC from (U.a) with 16-QAM and 
rate-3/4 turbo code achieves a spectral efficiency of 96 bps/Hz. 
This high spectral efficiency is achieved along with the full- 
diversity of order nN,,. However, since these STBCs are non- 
orthogonal, ML detection gets increasingly impractical for 
large n. Consequently, a key challenge in realizing the benefits 
of these large STBCs in practice is that of achieving near-ML 
performance for large n at low detection complexities. Our 
proposed detector, termed as the multistage likelihood ascent 
search (M-LAS) detector, presented in the following section 
essentially addresses this challenging issue. 

III. Proposed Multistage LAS Detector 

The proposed Af -LAS algorithm consists of a sequence of 
likelihood-ascent search stages, where the likelihood increases 
monotonically with every search stage. Each search stage 
consists of several sub-stages. There can be at most M sub- 
stages, each consisting of one or more iterations (the first sub- 
stage can have one or more iterations, whereas all the other 
sub-stages can have at most one iteration). In the first sub- 
stage, the algorithm updates one symbol per iteration such 
that the likelihood monotonically increases from one iteration 
to the next until a local minima is reached. Upon reaching this 
local minima, the algorithm initiates the second sub-stage. 



En — 1 \ — ^ n — 1 i li 

i=0 Z2i=0 ^0,^(^nt' 

Si=0 ^2,1 J2i=0 ^l,i^r\^* 



En— 1 
1=0 ^0/ 



i-v^n— 1 (n— l)z 

" 2^1=0 ^3,1 



(ILa) 



En— 1 ,i 



f-n— 1,2 ' 



^n—3,i ^ 
' Xn-2,iU!n 



Ei=0 ^n-4,i'^ni^ 



-1 / i2z /2 



c \-^n— 1 (7i-—l)i,j 
^ Ei=0 Xn-i^iUj'n V 

-r / 
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In the second sub-stage, a 2-symbol update is tried to further 
increase the likelihood. If the algorithm succeeds in increasing 
the likelihood by 2-symbol update, it starts the next search 
stage. If the algorithm does not succeed in the second sub- 
stage, it goes to the third sub-stage where a 3-symbol update is 
tried to further increase the likelihood. Essentially, in the Kth 
sub-stage, a X-symbol update is tried to further increase the 
likelihood. This goes on until a) either the algorithm succeeds 
in the Kth sub-stage for some K < M (in which case a new 
search stage is initiated), or b) the algorithm terminates. 

The M-LAS algorithm starts with an initial solution d*^"), 
given by d'"^ = By, where B is the initial solution filter, 
which can be a matched filter (MF) or zero-forcing (ZF) filter 
or MMSE filter The index m in d^"*) denotes the iteration 
number in a sub-stage of a given search stage. The ML cost 
function after the fcth iteration in a given search stage is 



C« = dt'^'^H^HdW - 2y^Hd 



(k) 



(12) 



A. One-symbol Update 

Let us assume that we update the pth symbol in the l)th 
iteration; p can take value from 1, • • • ,Nt for A1-PAM and 
1, • • • , 2Nt for Al-QAM. The update rule can be written as 



(13) 



where Bp denotes the unit vector with its pth entry only as one, 
and all other entries as zero. Also, for any iteration k, d'*^' 

(k) 

should belong to the space §, and therefore Xp can take only 
certain integer values. For example, in case of 4-PAM or 16- 
QAM (both have the same signal set Ap = {—3, —1, 1, 3}), 
Xp''^ can take values only from {—6, —4, —2, 0, 2, 4, 6}. Using 
( fT2] i and ( fTSl l, and defining a matrix G as 

G = H^H, (14) 
we can write the cost difference as 

ACp^+i A ^(fe+i)_^(fc) 

- A('=)'(G)p,p-2A«zW, (15) 

where hp is the pth column of H, z^''^ = H^(y - Hd'*^'), 4*"^ 
is the pth entry of the z'^''') vector, and (G)^^ is the (p,p)th 

entry of the G matrix. Also, let us define ap and Ip'^'^ as 

ap = (G)p,p, = |A«|. (16) 

With the above variables defined, we can rewrite ( fTsT i as 

AC^-+i = ;W'ap-2ZW|zW|sgn(Af)sgn(zW), (17) 

where sgn(.) denotes the signum function. For the ML cost 
function to reduce from the kth to the (A: + l)th iteration, the 
cost difference should be negative. Using this fact and that 
ftp and Zp*^-* are non-negative quantities, we can conclude from 
( fTTI l that the sign of Ap'^'' must satisfy 

sgn(A('=)) = sgn(zW). (18) 

Using ( fTSb in (fTTb . the ML cost difference can be rewritten as 



For Til'^p'') to be non-positive, the necessary and sufficient 
condition from ( fT9] ) is that 



/(fe) 



< 



2\z. 



(20) 



However, we can find the value of l'^'^ which satisfies ( |20|) 
and at the same time gives the largest descent in the ML 
cost function from the fcth to the (fc + l)th iteration (when 
symbol p is updated). Also, Zp*^-* is constrained to take only 
certain integer values, and therefore the brute-force way to 



get optimum 4'°'' is to evaluate at all possible values 

(k) 

of Ip . This would become computationally expensive as the 
constellation size Ai increases. However, for the case of 1- 
symbol update, we could obtain a closed-form expression for 
the optimum l^'' that minimizes .7^(4*^'), which is given by 
(corresponding theorem and proof are given in the Appendix) 

14^=^ I 



(k) 

p,opt 



(21) 



where [.] denotes the rounding operation, where for a real 
number x, [a;] is the integer closest to x. If the pth symbol 



dW, 



i.e., (ip \ were indeed updated, then the new value of 



the symbol would be given by 



However, d 



'{k+i) 



-1' = dW+ZWsgn(zW). 
can take values only in the set 



(22) 
, and 

therefore we need to check for the possibility of Sp^^^ 
being greater than [Ai — 1) or less than — (A^ — 1). If 
> {Ai — 1), then l'^^ is adjusted so that the new value 



J(fc+i) 



~(kA-\\ 

of dp with the adjusted value of 



Similarly, if df'^^^ 
the new value of d 



'^''^ using (|22li is [M-l) 



P 

< —{M — 1), then Ip"^' is adjusted so that 



(fc+i) 



(fc) 



-{M 



be obtained 



.p IS ^[jvi - 1). Let ^pi 
p Qpi after these adjustments. It can be shown that if 

is also non-positive. 
'"'),,),Vp=l,-- - ,2iV2. Now, let 



from I 

is non-positive, then 
We compute J^il^^l 



P 



Hi;.. 



opt ) 



(23) 



Jk+i) ^ 



■'.opt ' 

,-(fe) 

'■s.opt ' 



(24) 
(25) 



< 0, the update for the (fc + l)th iteration is 

^ dW+[W,sgn(.W)e., 
sgn(z('=))g„ 

where gs is the sth column of G. The update in (|25] | follows 
from the definition of z^*^) in If J'ili^t) > 0, then 

the 1-symbol update search terminates. The data vector at 
this point is referred to as '1-symbol update local minima.' 
After reaching the 1-symbol update local minima, we look for 
a further decrease in the cost function by updating multiple 
symbols simultaneously. 

B. Why Multiple Symbol Updates? 



A 



AC, 



k+l 



The motivation for trying out multiple symbol updates can 
be explained as follows. Let C § denote the set of data 
Z^'^^^ftp - 2/^'''|z^'''|. (19) vectors such that for any d G L^, if a iiT-symbol update is 
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performed on d resulting in a vector d', then ||y — Hd'|| > 
||y-Hd||. WenotethatdA/L G L^, Vi-T = 1, 2, • • • , 27Vt, be- 
cause any number of symbol updates on Aml will not decrease 
the cost function. We define another set = PljLi ^j - Note 
that dML e Mk, Vi^ = 1, 2, • • • , 2Nt, and Msat, = {dML}, 
i.e., 'M.2Nt is a singleton set with Aml as the only element. 
It is noted that if the updates are done optimally, then the 
output of the i^'-LAS algorithm converges to a vector in M/j-. 



< 



I, K - 1,2,. 



,2iV* 



1. For any 



Also, 

d e Mk, K = 1,2,- •• ,2Nt and d ^ Aml. it can be 
seen that d and Aml will differ in K + 1 or more locations. 
The probability that Aml = x increases with increasing 
SNR, and so the separation between d e Ma' and x will 
monotonically increase with increasing K. Since Aml G Mk, 
and |Ma'| decreases monotonically with increasing K, there 
will be lesser non-ML data vectors to which the algorithm 
can converge to for increasing K. Therefore, the probability 
of the noise vector n inducing an error would decrease with 
increasing K. This indicates that iC-symbol updates with 
large K could get near to ML performance with increasing 
complexity for increasing K. 

C. K-symbol Update, I < K < 2N^ 

In this subsection, we present the update algorithm for 
the general case where K symbols, 1 < K < 2Nf, are 
updated simultaneously in one iteration. iiT-symbol updates 



can be done in 



K 



ways, among which we seek to find 



that update which gives the largest reduction in the ML 
cost. Assume that in the (fc + l)th iteration, K symbols 
at the indices ii,i2,-'' j^k of d*^*"'^ are updated. Each ij, 
j = 1, 2, • • • ,K, can take values from 1, 2, • • • , Nf for M- 
PAM and 1, 2, • • • , 2Nf for M-QAM. Further, define the set 

of indices, U = {ii, «2, • • • , ^k}- The update rule for the K- 
symbol update can then be written as 



K 



(26) 



For any iteration k, d^''^ belongs to the space §, and therefore 

(k) (k) 

Xl. can take only certain integer values. In particular. A- G 



, where a''''' = {x\{x+dl^' ) G Ai. ,xj^O}. For example, 
for 16-QAM, A^^. = {-3,-1,1,3}, and if d['^'^ is -1, then 
A^'^' = {—2,2,4}. Using (fT2] l, we can write the cost difference 
function Aa'/'(A('=\AL"\-- - ,A^ 



^+i(Af,AW,...,A|t^)^C('=+i)-CW as 



A (fc) ^(k) 



Au!) 



3 = 1 



{G)^ 



K K 



q=l p=q+l 



2EA!f.f\ (27) 



where A' 



(k) 



which can be compactly written as 
(A-f , aI^*'',- ■ • ,X^';^) G A[f\ where A^^ denotes the Cartesian 
product of a['^\ A^f through to a[''J . 

For a given U, in order to decrease the ML cost, we would 

(fe)^ 



like to choose the value of the iiT-tuple (Aj^ , A 



(fc) ^(fe) 

i2 ' 



such that the cost difference given by (|27] l is negative. If mul- 
tiple X-tuples exist for which the cost difference is negative, 
we choose the ii'-tuple which gives the most negative cost 
difference. 

Unlike for 1 -symbol update, for iiT-symbol update we do not 



(fe) 



,A 



(fe) 



(fe) 



opt ' 12 ^opt ' 



since the 



have a closed-form expression for (A,^ 
which minimizes the cost difference over 
cost difference is a function of K discrete valued vari- 
ables. Consequently, a brute-force method is to evalu- 
aJ!!' , • ■ • , Aj*^^^ ) over all possible values of 



ate AC/i+^(A(;^-) 



(k) (k) 

{Xl_^ , X]^ , ■ ■ ■ ,XlJ). Approximate methods can be adopted 
to solve this problem using lesser complexity. One method 
based on zero-forcing is as follows. The cost difference 
function in ( |27] i can be rewritten as 



(fc) 



- 2A('=' 



Jk) 



and ¥u G 



(28) 

2 ^iifJ ' 

ip,», and G 

convex quadratic function of A.y' (the Hessian ¥u is positive 
definite with probability 1), a unique global minima exists, 
and is given by 



where A^f' = [A^^ Af ^ • • A^']^, z« = [z^zf^ 

i4 L ii 12 i ^ L4 L 1-] 12 

iK^it^ where (F^)^^^ = (G 
{1,2,. •• ,if}. Since AC;$+i(Af,A,^'^ ■■ 



However, the solution given by 
we first round-off the solution as 



-1 



(29) 



need not lie in 



Ak) 



So, 



0.5 A; 



(k) 

■u 



(30) 



(fc) 

)1IIL.C It 

is a vector. Further, let a}!-'' = [Af'^Ap' • • ■ Ap'r- It is still 



where the operation in (|30] | is done element-wise, since A^ 



possible that the solution A^^'' in ( [30] l need not lie in A^'^-'. 
This would result in d['^~^^'' ^ A^^ for some j. For example, 
if A,^. is 7W-PAM, then 4*^+^' ^ A,^ if d[f +X^'i:^ > {M - 1) 
or + A^*"' < — (A^ — 1) . In such cases, we propose the 



(k) 

following adjustment to A- for j ~ 1,2, ■ ■ ■ ,K: 



(M - 1) - df'\ when Af'> + df'> >{M-1) 



-{M 



d«,when A«+d«< 



(31) 



After these adjustments, we are guaranteed that ' G 
Therefore, the new cost difference function value is given by 



Aq$+^(A(f\Af 



). It is noted that the complexity 
of this approximate method does not depend on the size 

(k) 

of the set A^ , i.e., it has constant complexity. Through 
simulations, we have observed that this approximation results 
in a performance close to that of the brute-force method for 
K = 2 and 3. Defining the optimum U for the approximate 
method as U, we can write 



A 



{h,i2, ■ ■ ■ 
arg min 

U 

The ivT-update is successful and the update is done only if 



AC^+1 



y\i ^\2 ' 



-y^ik). 



(32) 
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AC*:+Va'-''\ A^*', • • • , A^*^') < 0. The update rules for the z^^) 
and d'*^^ vectors are given by 

z^'^^+i) = z«-VA(^)g.., (33) 

( J tj ''J 

i=i 

d(fc+i) = dW+VA^'^e.-. (34) 

D. Computational Complexity of the M-LAS Algorithm 

The complexity of the proposed A/-LAS algorithm com- 
prises of three components, namely, i) computation of the 
initial vector d^^\ ii) computation of H-'^H, and in) the 
search operation. Figure [T] shows the per-symbol complexity 
plots as a function of Nt = Nr for 4-QAM at an SNR of 6 dB 
using MMSE initial vector. Two good properties of the STBCs 
from CDA are useful in achieving low orders of complexity for 
the computation of d^*'^ and H^H. They are: i) the weight 
matrices A^^'^'s are permutation type, and ii) the N'l x 
matrix formed with x 1-sized a^c'' vectors as columns is a 
scaled unitary matrix. These properties allow the computation 
of MMSE/ZF initial solution in 0{N^Nr) complexity, i.e., in 
0{NtNr) per-symbol complexity since there are symbols 
in one STBC matrix. Likewise, the computation of H-^H can 
be done in 0{Nf) per-symbol complexity. 

The average per-symbol complexities of the 1-LAS and 2- 
LAS search operations are 0{Nf) and 0{Nf log Nt), respec- 
tively, which can be explained as follows. The average search 
complexity is the complexity of one search stage times the 
mean number of search stages till the algorithm terminates. 
For 1-LAS, the number of search stages is always one. There 
are multiple iterations in the search, and in each iteration all 
possible (■^^*) 1-symbol updates are considered. So, the per- 
iteration complexity in 1-LAS is 0{Nt), i.e., 0(1) complexity 
per symbol. Further, the mean number of iterations before 
the algorithm terminates in 1-LAS was found to be 0{Nt) 
through simulations. So, the overall per-symbol complexity 
of 1-LAS is 0{Nl). In 2-LAS, the complexity of the 2- 
symbol update dominates over the 1-symbol update. Since 
there are (^^' ) possible 2-symbol updates, the complex- 
ity of one search stage is 0{Nt), i.e., 0{Nt) complexity 
per symbol. The mean number of stages till the algorithm 
terminates in 2-LAS was found to be O(logiVt) through 
simulations. Therefore, the overall per-symbol complexity of 
2-LAS is 0{N^ \ogNt). These can be observed from Fig. 
[T] where it can be seen that the per-symbol complexity in 
the initial vector computation plus the 1-LAS/2-LAS search 
operation is 0{N^)/0{N^ log Nt); i.e., 1-LAS and 2-LAS 
complexity plots run parallel to the ciNf and C2Nt log Nt 
lines, respectively. With the computation of H^H included, 
the complexity order is more than N^. From the slopes of 
the plots in Fig. [T] we find that the overall complexities 
for Nt = 16 and 32 are proportional to Nf-'^ and Nt''^, 
respectively. 

For the special case of ILL-only STBCs (i.e., (5 = t = 1), 
the complexity involved in computing d^^^ and H^H can be 




5 I 1 [ 1 1 1 1 

2 3 4 5 6 7 8 

Fig. 1. Computational complexity of the proposed M-LAS algorithm in 
decoding non-orthogonal STBCs from CDA. MMSE initial vector, 4-QAM, 
SNR = 6 dB. 

reduced further. This becomes possible due to the follow- 
ing property of ILL-only STBCs. Let Va be the complex 
Nt X Nt matrix with ac'^ as its ith column. The com- 
putation of d(°) (or H^H) involves multiplication of 
with another vector (or matrix). The columns of can be 
permuted in such a way that the permuted matrix is block- 
diagonal, where each block is a A^t x Nt DFT matrix for 
5 = t = 1. So, the multiplication of by any vector 
becomes equivalent to a A'^t -point DFT operation, which can 
be efficiently computed using FFT in 0{Nt log Nt) complex- 
ity. Using this simplification, the per-symbol complexity of 
computing H^H is reduced from 0{N^) to 0{Nt log Nt). 
Computing d'") using MMSE filter involves the computation 
of ^Vf (I ® ((Hf H, + -^I)-iHf ))ye. The complexity 
of computing the vector (I (g) ((Hf + :j7^I)"^Hf ))yc 
is 0{NtNr), and the complexity of computing V^(I (g) 
((Hf He + ^I)-iHf ))y, is 0{NiNr). In the case of 
ILL-only STBC, because of the above-mentioned property, the 
complexity of computing Vf (I® ( (Hf + I) " ^ Hf ) )yc 
gets reduced to 0{N^ log Nt) from 0{NfNr). So the to- 
tal complexity for computing d'"^ in ILL-only STBC is 
0{NfNr) + 0{Nf log Nt), which gives a per-symbol com- 
plexity of O (Nr) +0 {log Nt). So, the overall per-symbol com- 
plexity for 1-LAS detection of ILL-STBCs is 0{N^ log Nt). 

E. Generation of Soft Outputs 

We propose to generate soft values at the ilf-LAS output 
for all the individual bits that constitute the M-PAMJM-QAM 
symbols as follows. These output values are fed as soft inputs 
to the decoder in a coded system. Let d = [xi,x2, - ■ ■ ,^2n^]' 
Xi E A,; denote the detected output symbol vector from the 
Af-LAS algorithm. Let the symbol Xi map to the bit vector 
bj = b,,2,- ■ • ,&i,A'J^, where = logj |Aij, and 6,^ G 
{+1, -1}, i = 1, 2, • • • , 27V"| and j = 1, 2, • • • , Ki. Let bij e R 
denote the soft value for the jth bit of the «th symbol. Given 
d, we need to find bij, \/{i,j). 
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Note that the quantity ||y — Hd||^ is inversely related to 
the likelihood that d is indeed the transmitted symbol vector 
Let the d vector with its jth bit of the ith symbol forced 
to +\ be denoted as vector d^^. Likewise, let d^^ be the 
vector d with its jth bit of the ith symbol forced to -L Then 
the quantities ||y — Hd^^||^ and ||y — Hd^^||^ are inversely 
related to the likelihoods that the jth bit of the ith transmitted 
symbol is H- 1 and - 1 , respectively. So, if | ] y — Hd^ ~ I i ^ ~ 1 1 Y ^ 
Hd;^"^!!^ is H-ve (or -ve), it indicates that the jth bit of the ith 
transmitted symbol has a higher hkelihood of being +\ (or -1). 
So, the quantity jjy — Hd^~||^ — ||y — Hd|^||^, appropriately 
normalized to avoid unbounded increase for increasing Nt, 
can be a good soft value for the jih bit of the ith symbol. 
With this motivation, we generate the soft output value for the 
jth bit of the ith symbol as 

- _ ||y-Hdrip-||y-Hdf IP 

where the normahzation by ||hi|p is to contain unbounded 
increase of hij for increasing Nt- The RHS in the above can 
be efficiently computed in terms of z and G as follows. Since 
d^"'' and d^^ differ only in the ith entry, we can write 

dr = df + A,,,e,. (36) 

Since we know d;^^ and d^^, we know j from ( [36] l. 
Substituting (|36] | in ([35]), we can write 

- ||y-Hdf -A,,,h,|p_|ly_Hdf IP 

= A;j|h,|p-2A,,>f(y-Hdf ) (37) 

= -A;,,||h,||2-2A.,,hf(y-Hdr). (38) 

If &i j = 1, then d;^^ = d and substituting this in ( l37T i and 
dividing by ||hi|p, we get 

= 4-2A,,,(^- (39) 

If &ij = —1, then d:^" = d and substituting this in ( |38] | and 
dividing by ||h,;|p, we get 

= -4-2A,,,^. (40) 

It is noted that z and G are already available upon the ter- 
mination of the A/-LAS algorithm, and hence the complexity 
of computing 6; j in ( [39] l and ( |40l i is constant. Hence, the 
overall complexity in computing the soft values for all the 
bits is 0{Nt logj TVI). We also see from and (gOll that the 
magnitude of hi,j depends upon Ai j. For large-size signal sets, 
the possible values of Ai j will also be large in magnitude. We 
therefore have to normalize 6^ j for the turbo decoder to func- 
tion properly. It has been observed through simulations that 
normalizing &i j by (%^) resulted in good performance. In 
[28 1, we have shown that this soft decision output generation 
method, when used in large V-BLAST systems, offers about 1 
to 1 .5 dB improvement in coded BER performance compared 
to that achieved using hard decision outputs from the 7\/-LAS 
algorithm. We have observed similar improvements in STBC 
MIMO systems also. In all coded BER simulations in this 
paper, we use the soft outputs proposed here as inputs to the 



decoder. 

IV. BER Performance with Perfect CSIR 

In this section, we present the uncoded/turbo coded BER 
performance of the proposed Af-LAS detector in decoding 
non-orthogonal STBCs from CDA, assuming perfect knowl- 
edge of CSI at the receive]^. In all the BER simulations in this 
section, we have assumed that the fade remains constant over 
one STBC matrix duration and varies i.i.d. from one STBC 
matrix duration to the other We consider two STBC designs; 

i) 'FD-ILL' STBCs where b = e^K t ^ m (ILa), and 

ii) TLL-only' STBCs where ,5 = t = L The SNRs in all 
the BER performance figures are the average received SNR 
per received antenna, 7, defined in Sec. |I3|. We have used 
MMSE filter as the initial filter in all the simulations. 

A. Uncoded BER as a Function of Increasing Nt = Nr 

In Fig. |2] we plot the uncoded BER performance of the 
proposed 1-, 2-, and 3-LAS algorithms in decoding ILL- 
only STBCs (4 x 4, 8 x 8, 16 x 16, 32 x 32 STBCs) 
for Nt = Nr = 4,8,16,32 and 4-QAM. SISO AWGN 
performance (without fading) and MMSE-only performance 
(i.e., without the search using LAS) are also plotted for com- 
parison. It can be seen that MMSE-only performance does not 
improve with increasing STBC size (i.e., increasing Nt = Nr). 
However, it is interesting to see that, when the proposed search 
using LAS is performed following the MMSE operation, the 
performance improves for increasing Nt = Nr, illustrating 
the performance benefit due to the proposed search strategy. 
For example, though the LAS detector performs far from 
SISO AWGN performance for small number of dimensions 
(e.g., 4 X 4,8 x 8 STBCs with 32 and 128 real dimensions, 
respectively), its large system behavior at increased number of 
dimensions (e.g., 16 x 16 and 32 x 32 STBCs with 512 and 
2048 real dimensions, respectively) effectively renders near 
SISO AWGN performance; e.g., with Nt = Nr = 16, 32, 
for BERs better than lO^'^, the LAS detector performs very 
close to SISO AWGN performance. We also observe that 3- 
LAS performs better than 2-LAS for Nt = Nr ^ 4, 8, and 2- 
LAS performs better than 1-LAS. Since close to SISO AWGN 
performance is achieved with 1-, 2-, or 3-symbol update itself, 
the cases of more than 3-symbol update, which will result in 
increased complexity with diminishing returns in performance 
gain, are not considered in the performance evaluation. 

B. Performance of FD-ILL Versus ILL-only STBCs 

In Fig. [3] we present uncoded BER performance comparison 
between FD-ILL versus ILL-only STBCs for 4-QAM at differ- 
ent Nt = Nr using 1-LAS detection. The BER plots in Fig.|3] 
illustrate that the performance of ILL-only STBCs with 1-LAS 
detection for Nt ^ Nr ^ 4, 8, 16, 32 and 4-QAM are almost as 
good as those of the corresponding FD-ILL STBCs. A similar 
closeness between the performance of ILL-only and FD-ILL 

'We will relax this perfect channel knowledge assumption in the next 
section, where we present an iterative detection/channel estimation scheme 
for the considered large STBC MIMO system. 
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Fig. 2. Uncoded BER of the proposed 1-LAS, 2-LAS and 3-LAS detectors 
for ILL-only STBCs for different Nt = Nr. 4-QAM, 2Nt bps/Hz. BER 

improves as Nt = Nr increases and approaches SISO AWGN performance 
for large Nt = Nr. 
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Fig. 4. Uncoded BER comparison between perfect codes and ILL-only 

STBCs for different Nt = Nr, 4-QAM, 2Nt bps/Hz, 1-LAS detection. For 
small dimensions (e.g., Ax A, 6x6, 8x8), perfect codes with 1-LAS detection 
perform worse than ILL-only STBCs. For large dimensions (e.g., 16 X 16, 
32 X 32j, ILL-only STBCs and perfect codes perform almost same. 
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Fig. 3. Uncoded BER comparison between FD-ILL and ILL-only STBCs 
for different Nt = Nr. 4-QAM, 2Nt bps/Hz, 1-LAS detection. ILL-only 
STBCs perform almost same as FD-ILL STBCs. 



STBCs is observed in the turbo coded BER performance as 
well, which is shown in Fig. [8] for a 16 x 16 STBC with 4- 
QAM and turbo code rates of 1/3, 1/2 and 3/4. This is an 
interesting observation, since this suggests that, in such cases, 
the computational complexity advantage with S — t — 1 in 
ILL-only STBCs can be taken advantage of without incurring 
much performance loss compared to FD-ILL STBCs. 



C. Decoding and BER of Perfect Codes of Large Dimensions 

While the STBC design in (11. a) offers both ILL and FD, 
perfect code^ under ML decoding can provide coding gain 
in addition to ILL and FD fr7l -ll21 1. Decoding of perfect 
codes has been reported in the literature for only up to 5 
antennas using sphere/lattice decoding ll20l . The complexity of 
these decoders are prohibitive for decoding large-sized perfect 
codes, although large-sized codes are of interest from a high 
spectral efficiency view point. We note that, because of its low- 
complexity attribute, the proposed AZ-LAS detector is able to 
decode perfect codes of large dimensions. In Figs. |4] and |5] 
we present the simulated BER performance of perfect codes 
in comparison with those of ILL-only and FD-ILL STBCs for 
up to 32 transmit antennas using 1-LAS detector 

In Fig. m we show uncoded BER comparison between 
perfect codes and ILL-only STBCs for different Nt = Nr 
and 4-QAM using 1-LAS detection. The 4x4 and 6x6 
perfect codes are from |fT9l , and the 8x8, 16 x 16 and 
32 X 32 perfect codes are from |20|. From Fig. H] it can 
be seen that the 1-LAS detector achieves better performance 
for ILL-only STBCs than for perfect codes, when codes 
with small number of transmit antennas are considered (e.g., 
Nt = 4,6,8). While perfect codes are expected to perform 
better than ILL-only codes under ML detection for any Nt, 
we observe the opposite behavior under 1-LAS detection for 
small Nt (i.e., ILL-only STBCs performing better than perfect 
codes for small dimensions). This behavior could be attributed 
to the nature of the LAS detector, which achieves near- 
optimal performance only when the number of dimensions is 

*We note that the definition of perfect codes differ in |19| and |20|. The 
perfect codes covered by the definition in |20| includes the perfect codes 
of 1191 as a proper subclass. However, for our purpose of illustrating the 
performance of the proposed detector in large STBC MIMO systems, we 
refer to the codes in 1191 as well as (201 as perfect codes. 
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Fig. 5. Uncoded BER comparison between perfect codes, ILL-only, and 
FD-ILL STBCs for Nt = Nr = 16, 32, 16-QAM, 4Nt bps/Hz, 1-LAS 
detection. For larger modulation alphabet sizes (e.g., 16 QAM), perfect codes 
with 1-LAS detection perform poorer than ILL-only and FD-ILL STBCs. 



Nt = Nr,4-QAM, 2Nt bps/Hz 
""ho Iterations in ISIC 



-9-4x4 ILL-only STBC, ISIC (Choi et al [30]) 
-0-8x8 ILL-only STBC, ISIC (Choi et al [30]) 
- H - 16x16 ILL-only STBC, ISIC (Choi et al [30]) 
-e— 4x4 ILL-only STBC, 1-LAS (Proposed) 
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-B— 16x16 ILL-only STBC, 1-LAS (Proposed) 
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Fig. 6. Uncoded BER comparison between the proposed 1-LAS algoiitiun 
and the ISIC algorithm in 1301 for ILL-only STBCs for different Nt = Nr. 
4-QAM, 2Nt bps/Hz. MMSE initial vectors for both 1-LAS and ISIC. 1-LAS 
performs significantly better than ISIC in t30\l . 



larg^ and it appears that, in the detection process, LAS is 
more effective in disentangling the symbols in STBCs when 
S = t = 1 (i.e., in ILL-only STBCs) than in perfect codes. 
The performance gap between perfect codes and ILL-only 
STBCs with 1-LAS detection diminishes for increasing code 
sizes such that the performance for 32 x 32 perfect code and 
ILL-only STBC with 4-QAM are almost same and close to 
the SISO AWGN performance. In Fig. |5] we show a similar 
comparison between perfect codes, ILL-only and FD-ILL only 
STBCs when larger modulation alphabet sizes (e.g., 16-QAM) 
are used in the case of 16 x 16 and 32 x 32 codes. It can be 
seen that with higher-order QAM like 16-QAM, perfect codes 
with 1-LAS detection perform poorer than ILL-only and FD- 
ILL STBCs, and that ILL-only and FD-ILL STBCs perform 
almost same and close to the SISO AWGN performance. The 
results in Figs. |4] and |5] suggest that, with 1-LAS detection, 
owing to the complexity advantage and good performance in 
using 6 = t = 1, ILL-only STBCs can be a good choice for 
practical large STBC MIMO systems Il27l.ll52l. 

D. Comparison with Other Large-MIMO Architecture/Detec- 
tor Combinations 

In ||30l , Choi et al have presented an iterative soft interfer- 
ence cancellation (ISIC) scheme for multiple antenna systems, 
derived based on maximum a posteriori (MAP) criterion. We 
compared the performance of the ISIC scheme in |30| with 
that of the proposed 1-LAS algorithm in detecting 4x4, 
8 X 8 and 16 X 16 ILL-only STBCs with Nt = Nr and 4- 
QAM. Figure |6] shows this performance comparison. In ll30l . 
zero-forcing vector was used as the initial vector in the ISIC 
scheme. However, performance is better with MMSE initial 

'in 1 29 1, we have presented an analytical proof that the bit error perfor- 
mance of 1-LAS detector for V-BLAST with 4-QAM in i.i.d. Rayleigh fading 
converges to that of the ML detector as Nt, Nr — » oo, keeping A^t = Nr. 



vector. Since we used MMSE initial vector for 1-LAS, we 
have used MMSE initial vector for the ISIC algorithm as well. 
Also, in 1 30 1, 4 to 5 iterations were shown to be good enough 
for the ISIC algorithm to converge. In our simulations of the 
ISIC algorithm, we used 10 iterations. Two key observations 
can be made from Fig.|6] i) like the 1-LAS algorithm, the ISIC 
algorithm also shows large system behavior (i.e., improved 
BER for increasing Nt = Nr), and 2) the proposed 1-LAS 
algorithm outperforms the ISIC algorithm by about 3 to 5 
dB at IG^'^ uncoded BER. In addition, the complexity of 
the ISIC scheme is higher than the proposed scheme (see the 
complexity comparison in Table I). 

Next, we compare the proposed large-MIMO architecture 
using STBCs from CDA and M-LAS detection with other 
large-MIMO architectures and associated detectors reported 
in the literature. Large-MIMO architectures that use stack- 
ing of multiple small-sized STBCs and interference cancel- 
lation (IC) detectors for these schemes have been investi- 
gated in II22I . II3TI . II32I . Here, we compare different architec- 
ture/detector combinations, fixing the total number of trans- 
mit/receive antennas and spectral efficiency to be same in 
all the considered combinations. Specifically, we fix Nt = 
Nr ~ 16 and a spectral efficiency of 32 bps/Hz for all 
the combinations. We compare the following seven differ- 
ent architecture/detector combinations which use the same 
Nt ^ Nr = 16 and achieve 32 bps/Hz spectral efficiency (see 
Table I): i) proposed scheme using 16 x 16 ILL-only STBC 
(rate-16) with 4-QAM and 1-LAS detection, ii) 16 x 16 ILL- 
only STBC (rate-16) with 4-QAM and ISIC algorithm in JSOl 
with 10 iterations, in) four 4x4 stacked QOSTBCs (rate- 
1) with 256-QAM and IC algorithm presented in ll22l . iv) 
eight 2x2 stacked Alamouti codes (rate-1) with 16-QAM 
and IC algorithm in |22|, v) 16 x 16 V-BLAST scheme (rate- 
16) with 4-QAM and sphere decoder (SD), vi) 16 x 16 V- 
BLAST scheme (rate-16) with 4-QAM and ZF-SIC detector. 
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-16x16 V-BLAST, 4-QAM, ZF-SIC detector 

- Four 4x4 Stacked QOSTBCs, 256-QAM, IG in [22] 

- Eigtit 2x2 Stacked Alamouti codes, 16-QAM. IC in [22] 
-16x16 V-BLAST, 4-QAIVI, iSIC withi 10 iterations in [30] 

- 16x16 ILL-only STBC, 4-QAIVI, ISIC with 10 iterations in [30] 
-16x16 V-BLAST, 4-QAM, Sphere decoder 

-16x16 ILL-only STBC, 4-QAM, 1-LAS detector (Proposed) 
-SISQ AWGN, 4-QAM 

For all architectures 
Nr = Nt = 1 6 

Spectral effictency = 32:bps/Hlz: 




10 15 20 25 

Average Received SNR (dB) 




Rate-1/3 turbo (ILL-Only STBC) 
Rate-1/2 turbo (ILL-Only STBC) 
Rate-3/4 turbo (ILL-Only STBC) 
Rate-1/3 turbo (FD-ILL STBC) 
Rate-1/2 turbo (FD-ILL STBC) 
Rate-3/4 turbo (FD-ILL STBC) 

Min SNR for capacity - 10.6 b/s/Hz 

Min SNR for capacity = 1 6 b/s/Hz 

Min SNR for capacity = 24 b/s/Hz 



16x16 STBCs, 4-QAM 
Nr = Nt = 1 6, 1 -LAS detection 



(3, 6) 



Average Received SNR (dB) 



Fig. 7. Uncoded BER comparison between different large-MIMO archi- 
tecture/detector combinations for given number of transmit/receive antennas 
{Nt = Nr = 16) and spectral efficiency (32 bps/Hz). Proposed scheme 
performs better than other architecture/detector combinations considered. It 
outperforms them in complexity as well (see Table I). 

and vii) 16 x 16 V-BLAST scheme (rate-16) with 4-QAM 
and ISIC algorithm in [30|. We present the BER performance 
comparison of these different combinations in Fig. |7] We also 
obtained the complexity numbers (in number of real operations 
per bit) from simulations for these different combinations at 
an uncoded BER of 5 x 10^^; these numbers are presented 
in Table I, along with the SNRs at which 5 x 10"^ uncoded 
BER is achieved. The following interesting observations can 
be made from Fig. [Tjand Table I: 

• the proposed scheme (combination i)) significantly out- 
performs the stacked architecture/IC detector combina- 
tions presented in |22| (combinations iii) and iw)); e.g., 
at 5 X 10^^ uncoded BER, the proposed scheme performs 
better than the stacked architecture/IC in [22 1 by 17 dB 
(for four 4x4 QOSTBCs) and 10 dB (for eight 2x2 
Alamouti codes). Also, the proposed scheme achieves 
this significant performance advantage at a much lesser 
complexity than those of the stacked architecture/IC 
combinations (see Table I). 

• the proposed scheme performs slightly better than the V- 
BLAST/sphere decoder combination (combination w)); 
6.8 dB in proposed scheme versus 7 dB in V-BLAST 
with sphere decoding at 5 x 10~^ uncoded BER. Im- 
portantly, the proposed scheme enjoys a significant com- 
plexity advantage (by more than an order) over the V- 
BLAST/sphere decoder combination. 

. the ISIC algorithm in [30| applied to ILL-only STBC 
detection (combination ii)) is inferior to the proposed 
scheme in both performance (by about 4.5 dB at 5 x 
10~^ uncoded BER) as well as complexity (by about two 
orders). 

. the ISIC algorithm in |30| appHed to 16 x 16 V-BLAST 
detection (combination wm)) is also inferior to the pro- 
posed scheme in BER performance (by about 3.8 dB at 



Fig. 8. Turbo coded BER of 1-LAS detector for 16 X 16 FD-ILL and 
ILL-only STBCs. Nt = N,- = 16, 4-QAM, turbo code rates: 1/3, 1/2, 3/4 
(10.6, 16, 24 bps/Hz). I -LAS detector performs close to within 4 dB from 
capacity. ILL-only STBCs preform as good as FD-ILL STBCs. 

5 X 10^^ uncoded BER) as well as complexity (by about 
a factor of 2). 

• comparing the stacked architecture/IC combinations 
with V-BLAST/ZF-SIC (combination vi)) and V- 
BLAST/ISIC combinations, we see that although the 
diversity orders achieved in stacked architecture/IC com- 
binations are high (see their slopes at high SNRs in Fig. 
El), V-BLAST with ZF-SIC and ISIC detectors perform 
much better at low and medium SNRs. 
In summary, the proposed scheme outperforms the other 
considered architecture/detector combinations both in terms 
of performance as well as complexity. 

E. Turbo Coded BER and Nearness-to-Capacity Results 

Next, we evaluated the turbo coded BER performance of 
the proposed scheme. In all the coded BER simulations, we 
fed the soft outputs presented in Sec. IIII-EI as input to the 
turbo decoder In Fig. [8] we plot the turbo coded BER of 
the 1-LAS detector in decoding 16 x 16 FD-ILL and ILL- 
only STBCs, with Nt ^ Nr = 16, 4-QAM and turbo code 
rates 1/3 (10.6 bps/Hz), 1/2 (16 bps/Hz), 3/4 (24 bps/Hz). 
The minimum SNRs required to achieve these capacities in a 
16 X 16 MIMO channel (obtained by evaluating the ergodic 
capacity expression in 1 1 1 through simulation) are also shown. 
It can be seen that the 1-LAS detector performs close to within 
just about 4 dB from capacity, which is very good in terms of 
nearness-to-capacity considering the high spectral efficiencies 
achieved. It can also be seen that the coded BER performance 
of FD-ILL and ILL-only STBCs are almost the same for the 
system parameters considered. 

F. Effect of MIMO Spatial Correlation 

In generating the BER results in Figs. |2] to [Sj we have 
assumed i.i.d. fading. However, MIMO propagation conditions 
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No. 


Large-MIMO Architecture/Detector Combinations 


Complexity 
(in # real operations 


SNR required 
to achieve 5 x 10^^ 




(fixed Nt=Nr = 16 and 32 bps/Hz 


per bit) at 5 x 10"^ 


uncoded BER 




for all combinations) 


uncoded BER 


(from Fig. |7ll 




16 X 16 ILL-only CDA STBC (rate-16). 






i) 


4-OAM and 1-LAS detection 
[Proposed scheme] 


3 473 X 10^ 


6.8 dB 


11^ 


Ifi X Ifi TT T -onlv PDA STRP rrate-lfi") 








4-QAM and ISIC algorithm in ||30l 


1.187 X lO'^ 


11.3 dB 




Four 4x4 starkeH ratp-1 OOSTRCs 








256-QAM and IC algorithm in |22| 


5.54 X 10'^ 


24 dB 




Pi trht V 9 t;tfipVf*rl ratp- 1 A 1 am nil ti poHp*; 








16-QAM and IC algorithm in d 


8.719 X 10^ 


17 dB 


v) 
^ ) 


16 X 16 V-BLAST (rate-16) scheme, 








4-QAM and sphere decoding 


4.66 X 10* 


7 dB 


VI) 


16 X 16 V-BLAST (rate-16) scheme, 








4-QAM and V-BLAST detector (ZF-SIC) 


1.75 X 10'' 


13 dB 


vii) 


16 X 16 V-BLAST (rate-16) scheme. 








4-QAM and ISIC algorithm in ll30l 


7.883 X 10^ 


10.6 dB 



TABLE I 

Complexity AND performance comparison of differentlarge-MIMO architecture/detector combinations, all with TVt = TVr = 16 
AND ACHIEVING 32 BPS/Hz SPECTRAL EFFICIENCY. Proposed scheme outperforms the other considered architectures/detectors both in terms of 

performance as well as complexity. 



witnessed in practice often render the i.i.d. fading model as 
inadequate. More realistic MIMO channel models that take 
into account the scattering environment, spatial correlation, 
etc., have been investigated in the literature Il23l . ll33l . For 
example, spatial correlation at the transmit and/or receive side 
can affect the rank structure of the MIMO channel resulting in 
degraded MIMO capacity [33|. The structure of scattering in 
the propagation environment can also affect the capacity ll23l . 
Hence, it is of interest to investigate the performance of the M- 
LAS detector in more realistic MIMO channel models. To this 
end, we use the non-line-of-sight (NLOS) correlated MIMO 
channel model proposed by Gesbert et in [23 1, and evaluate 
the effect of spatial correlation on the BER performance of the 
M-LAS detector ll34l . 

We consider the following parameter^ in the simulations: 
/c = 5 GHz, i? = 500 m, 5 = 30, Dt = Dr = 20 m, Ot = 
Or = 90°, and dt ^ dr ^ 2A/3. For /c = 5 GHz, A = 6 cm 
and dt — dr = A cm. In Fig. [T] we plot the BER performance 
of the 1-LAS detector in decoding 16 x 16 ILL-only STBC 
with Nt ^ Nr = 16 and 16-QAM. Uncoded BER as well as 
rate-3/4 turbo coded BER (48 bps/Hz spectral efficiency) for 
i.i.d. fading as well as correlated fading are shown. In addition, 

^Please see (231 for more elaborate details of the spatially coirelated MIMO 
channel model. We note that this model can be appropriate in application 
scenarios like high data rate wireless IPTV/HDTV distribution using high 
spectral efficiency large-MIMO links, where large Nt and Nr can be placed 
at the base station (BS) and customer premises equipment (CPE), respectively. 

'The parameters used in the model in 123] include: Nt, Nr : # transmit and 
receive (omni-directional) antennas; dt,dr. spacing between antenna elements 
at the transmit side and at the receive side; R: distance between transmitter 
and receiver, Dj,Dj.: transmit and receive scattering radii; S: number of 
scatterers on each side; 9t,8r- angular spread at the transmit and receiver 
sides, and /c,A: canier frequency, wavelength. 



from the MIMO capacity formula in [1|, we evaluated the 
theoretical minimum SNRs required to achieve a capacity of 
48 bps/Hz in i.i.d. as well as correlated fading, and plotted 
them also in Fig. [7] It is seen that the minimum SNR required 
to achieve a certain capacity (48 bps/Hz) gets increased for 
correlated fading compared to i.i.d. fading. From the BER 
plots in Fig. |7] it can be observed that at an uncoded BER 
of 10~^, the performance in correlated fading degrades by 
about 7 dB compared that in i.i.d. fading. Likewise, at a rate- 
3/4 turbo coded BER of 10^"*, a performance loss of about 6 
dB is observed in correlated fading compared to that in i.i.d. 
fading. In terms of nearness to capacity, the vertical fall of the 
coded BER for i.i.d. fading occurs at about 24 dB SNR, which 
is about 13 dB away from theoretical minimum required SNR 
of 11.1 dB. With correlated fading, the detector is observed 
to perform close to capacity within about 18.5 dB. One way 
to alleviate such degradation in performance due to spatial 
correlation can be by providing more number of dimensions 
at the receive side, which is highlighted in Fig. |9] 

Figure |9] illustrates that the I-LAS detector can achieve 
substantial improvement in uncoded as well as coded BER 
performance in decoding 12x12 ILL-only STBC by increasing 
Nr beyond Nt for 16-QAM in correlated fading. In the 
simulations, we have maintained Nrdr — 72 cm and dt = dr 
in both the cases of symmetry (i.e., Nt = Nr = 12) as well 
as asymmetry (i.e., N = 12, A'^,. = 18). By comparing the 
1-LAS detector performance with [Nt = A'^,- = 12] versus 
[Nt = 12, iV,. = 18], we observe that the uncoded BER 
performance with [N = 12, Nr = 18] improves by about 17 
dB compared to that of [N = iV,- = 12] at 2 x 10"^ BER. 
Even the uncoded BER performance with [N = 12, A'^,. = 18] 
is significantly better than the coded BER performance with 



12 ACCEPTED IN IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING: SPL. ISS. ON MANAGING COMPLEXITY IN MULTIUSER MIMO SYSTEMS 



-0- Nt = Nr = 1 2, uncoded 
-□- Nt = 1 2, Nr = 1 8, uncoded 

Uncoded SISO AWGN 

-0- Nt = Nr = 1 2, rate-3/4 turbo coded 
-B- Nt = 1 2, Nr = 1 8, rate-3/4 turbo coded 
- Min. SNR for Capacity = 36 bps/Hz (Nt = Nr = 12) 
Min. SNR for capacity = 36 bps/Hz (Nt = 12, Nr » 16) 




1 Pilot 
Matrix 



12x12 ILL-only STBC, 16-QAIUI 
Nt= 12, Nr= 12,18, 1 -LAS detection 



Correlated MiMO Chi parameters: 
fc = 5 GHz, R = 500 m, S = 30 
N d^ = 72 cm, d| . d^ 
= = 20 m 
1 = 90 deg. 



20 25 30 35 40 45 50 

Average Received SNR (dB) 

Fig. 9. Effect of A^^ > Nt in correlated MIMO fading in |23 | keeping 
Nrdr constant and dt = dr. Nrdr = 72 cm, fc = 5 GHz, R = 500 m, 
S = 30, Dt = Dr = 20 m, dt = Or = 90°, 12 x 12 ILL-only STBC, 
Aft = 12, Nr = 12, 18, 16-QAM, rate-3/4 turbo code, 36 bps/Hz. Increasing 
# receive dimensions alleviates the loss due to spatial correlation. 



[iVt = Nr = 12] by about 11.5 dB at 10"^ BER. This 
improvement is essentially due to the ability of the 1-LAS 
detector to effectively pick up the additional diversity orders 
provided by the increased number of receive antennas. With 
a rate-3/4 turbo code (i.e., 36 bps/Hz), at a coded BER of 
10^'*, the 1-LAS detector achieves a significant performance 
improvement of about 13 dB with \Nt — 12, Nr — 18] com- 
pared to that with [Nt = Nr ^ 12]. With [Nt = 12, iV^ = 18], 
the vertical fall of coded BER is such that it is only about 
8 dB from the theoretical minimum SNR needed to achieve 
capacity. This points to the potential for reaUzing high spectral 
efficiency multi-gigabit large-MIMO systems that can achieve 
good performance even in the presence of spatial correlation. 
We further remark that transmit correlation in MIMO fading 
can be exploited by using non-isotropic inputs (precoding) 
based on the knowledge of the channel correlation matrices 
ll35l - ll37l . While ll35l - ll37l propose precoders in conjunction 
with orthogonal/quasi-orthogonal small MIMO systems in 
correlated Rayleigh/Ricean fading, design of precoders for 
large-MIMO systems can be investigated as future work. 

V. Iterative Detection/Channel Estimation 

In this section, we relax the perfect CSIR assumption 
made in the previous section, and estimate the channel matrix 
based on a training-based iterative detection/channel estima- 
tion scheme ll38l . Training -based schemes, where a pilot 
signal known to the transmitter and the receiver is sent to 
get a rough estimate of the channel (training phase) has 
been studied for STBC MIMO systems in |l39l-||42l. Here, 
we adopt a training-based approach for channel estimation 
in large STBC MIMO systems. In the considered training- 
based channel estimation scheme, transmission is carried out 
in frames, where one Nt x Nt pilot matrix, Xc'''' G (^NtxNt^ 
for training purposes, followed by Nd data STBC matrices, 
g C^tx^t^ i ^ l,2,...,Nd, are sent in each frame 
as shown in Fig. [TT] One frame length, T, (taken to be the 



A',l Data STBCs - 



Pilot 
Matrix 



Space 



' ' I ^ time 

1 Frame ^ 

Fig. 10. Transmission scheme with one pilot matrix followed by A^^j data 
STBC matrices in each frame. 



channel coherence time) is T = {Nd + l)Nt channel uses. A 
frame of transmitted pilot and data matrices is of dimension 
Nt X Nt{l + Nd), which can be written as 



x.T> x(i) x(2) 



(41) 



As in B3II . let and 7^ denote the average SNR during 
pilot and data phases, respectively, which are related to the 
average received SNR 7 as 'y{Nd + 1) = 7p + Ndjd- 
Define Pp = and Pd = Let Eg denote the average 
energy of the transmitted symbol during the data phase. The 
average received signal power during the data phase is given 

by E[tr(xi'^xi'^ )] = N^Es, and the average received 



signal power during the pilot phase is E[tr(xi^''X,'"^'' 
= ^,Nt, where ^, ^ 



Pa 



For optimal training, 

the pilot matrix should be such that X^^^'^X^^''^^ = ^Iat^ B3]| . 
As in Sec. |Il] let He G C^"-^^' denote the channel matrix, 
which we want to estimate. We assume block fading, where 
the channel gains remain constant over one frame consisting 
of (1 + Nd)Nt channel uses, which can be viewed as the 
channel coherence time. This assumption can be valid in 
slow fading fixed wireless applications (e.g., as in possible 
applications like BS-to-BS backbone connectivity and BS- 
to-CPE wireless IPTV/HDTV distribution). For this training- 
based system and channel model, Hassibi and Hochwald 
presented a lower bound on the capacity in f43l; we will 
illustrate the nearness of the performance achieved by the 
proposed iterative detection/estimation scheme to this bound. 
The received frame is of dimension Nr x Nt{\ + Nd), and 
can be written as 



= Y(i) Yp) • • • Y(^^ 



= B.cXc+K ,(42) 



where Nc 



N 



is the Nr 



Nt{l + Nd) noise matrix and its entries are modeled as i.i.d. 
C7V(0,(t2 ^ Et^y Equation ^ can be decomposed into 
two parts, namely, the pilot matrix part and the data matrices 
part, as 



= H, XlD X?) . . . Xl"''^ 



NlD Nl2) . . . Nl 



(43) 



(44) 
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A. MMSE Estimation Scheme 

A straight-forward way to achieve detection of data symbol: 
with estimated channel coefficients is as follows: 

1) Estimate the channel gains via an MMSE estimator fron 
the signal received during the first Nt channel uses (i.e. 
during pilot transmission); i.e., given Y^P and X*f' , ai 
estimate of the channel matrix He is found as 

Hf = YP (Xf )^ [a^I^,, +Xf>(Xf )^]"' (45 

2) Use the above H^''* in place of He in the LAS algorithn 
(as described in Sections and Ullb and detect th« 
transmitted data symbols. 

We refer to the above scheme as the 'MMSE estimation 
scheme.' In the absence of the knowledge of cr^, a zero-forcing 
estimate can be obtained at the cost of some performance 
loss compared to the MMSE estimate. The performance of 
the estimator can be improved by using a cyclic minimization 
technique for minimizing the ML metric B4ll . 

B. Proposed Iterative Detection/Estimation Scheme 

Techniques that employ iterations between channel estima- 
tion and detection can offer improved performance. Iterative 
receiver algorithms are attractive to achieve a good tradeoff 
between performance and complexity |45l-ETI- In |45]-|47J, 
receivers that iterate between channel estimation, multiuser 
detection and channel decoding in coded CDMA systems are 
presented. Similar iterative techniques in the context of MIMO 
and MIMO-OFDM systems are presented in ||48l-|l5T|. Here, 
we propose an iterative scheme, where we iterate between 
channel estimation and detection in the considered large STBC 
MIMO system. The proposed scheme works as follows: 
1) Obtain an initial estimate of the channel matrix using 
the MMSE estimator in ( |45l ) from the pilot part. 
Using the estimated channel matrix, detect the data 
STBC mati-ices xi*\ i = 1, 2, • • • , TVd using the LAS 
detector. Substituting these detected STBC matrices into 
gB, form X^:^*. 

Re-estimate the channel matrix using X^^* from the 
previous step, via 



2) 



3) 



est\H 



est\H-\ 



a'lM.+Xnxr) 



(46) 



4) Iterate steps 2 and 3 for a specified number of iterations. 
The total complexity of obtaining the MMSE estimate of the 
channel matrix Hf * in (gS]) and ^ is 0{N'^Nr) + 0[Nf ), 
which is less than the total complexity of 1-LAS detection of 
0[Nf log Nt) for ILL-only STBCs. 

C. BER Performance with Estimated CSIR 

We evaluated the BER performance of the 1-LAS detector 
using estimated CSIR, where we estimate the channel gain 
matrix through the training-based estimation schemes describ- 
ed in the previous two subsections. We consider the BER 
performance under three scenarios, namely, i) under perfect 
CSIR, ii) under CSIR estimated using the MMSE estimation 
scheme in Sec. IV-AI and Hi) under CSIR estimated using the 



- Perfect CSIR 

- 1P + 8D (H-H bound) 

- IP + ID (H-H bound) 



lex 16 MIMO Channel 



24 bps/Hz 



. 21^3 bp_s/Hz j< ' 



J2_bES/Hz 



4 6 8 

Average SNR (dB) 



Fig. 1 1 . Hassibi-Hochwald (H-H) capacity bound for 1P-I-8D (T = 144, r = 
16, 13p = I3d = 1) and IP-l-lD (T = 32, t = 16, jSp = 13^ = 1) training for 
a 16 X 16 MIMO channel Perfect CSIR capacity is also shown. 



iterative detection/estimation scheme in Sec. IV-BI In the case 
of estimated CSIR, we show plots for IP+NdD training, where 
by IP-niVrfD training we mean a training scheme with a frame 
size of 1 + TVrf matrices, with 1 pilot matrix followed data 
STBC matrices from CDA. For this IP+N^D training scheme, 
a lower bound on the capacity is given by |[43l 



C > 



logdet I 



HcHj. 



, (47) 



where T and r, respectively, are the frame size (i.e., chan- 
nel coherence time) and pilot duration in number of chan- 



nel uses, and cr? = 

He 



NtNr 



[tr{HeHf}], where H, 



E[Hc I XP,Y'-P] is the MMSE estimate of the channel 
gain matrix. We computed the capacity bound in ( |47] | through 
simulations for 1PH-8D and IP-nlD training for a 16 x 16 
MIMO channel. For 1PH-8D training T = (1 + 8)16 = 144, 
T = 16, and for IP-nlD training T = (1 + 1)16 32, r = 16. 
In computing the bounds (shown in Fig. fTTT i and in BER 
simulations (in Figs. [T2l and [T3]l. we have used (3p — (3d = 1- In 
Fig HU we plot the computed capacity bounds, along with the 
capacity under perfect CSIR 1 1 1. We obtain the minimum SNR 
for a given capacity bound in ( |47] i from the plots in Fig. [TT] 
and show (later in Fig. fTTT i the nearness of the coded BER of 
the proposed scheme to this SNR limit. We note that improved 
capacity and BER performance can be achieved if optimum 
pilot/data power allocation derived in |43J is used instead of 
the allocation used in Figs. [TT] to [13] (i.e., Pp = Pd ^ !)■ We 
have used the optimum power allocation in |431 for generating 
the BER plots in Figs. [T4land[T5] In all the BER simulations 
with training, ^/JIIn^ is used as the pilot matrix. ILL-only 
STBCs and 1-LAS detection are used. 

First, in Fig. [TJ] we plot the uncoded BER performance 
of I -LAS detector when IP-nlD and 1PH-8D training are used 
for channel estimation in a 16 x 16 STBC MIMO system 
with Nt = Nr = 16 and 4-QAM. BER performance with 
perfect CSIR is also plotted for comparison. From Fig. [12] 
it can be observed that, as expected, the BER degrades with 
estimated CSIR compared to that with perfect CSIR. With 
MMSE estimation scheme, the performance with IP-nlD and 
IPh-SD are same because of the one-shot estimation. Also, with 
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Fig. 12. Uncoded BER of 1-LAS detector for 16 X 16 ILL-only STBC 
with i) perfect CSIR, ii) CSIR using MMSE estimation scheme, and Hi) 
CSIR using iterative detection/channel estimation scheme (4 iterations). A^t = 
Nr = 16, 4-QAM, IP+ID (T = 32, r = 16, (5p = = l) and 1P+8D 
(T = 144, T = 16, 13p = I3d = 1) training. 

IP+ID training, both the MMSE estimation scheme as well 
as the iterative detection/estimation scheme (with 4 iterations 
between detection and estimation) perform almost the same, 
which is about 3 dB worse compared to that of perfect CSIR 
at an uncoded BER of 10"^. This indicates that with IP+A^^D 
training, iteration between detection and estimation does not 
improve performance much over the non-iterative scheme (i.e., 
the MMSE estimation scheme) for small N^. With large Nd 
(e.g., slow fading), however, the iterative scheme outperforms 
the non-iterative scheme; e.g., with 1PH-8D training, the perfor- 
mance of the iterative detection/estimation improves by about 
1 dB compared to the MMSE estimation. 

Next, in Fig. [13] we present the rate-3/4 turbo coded BER 
of 1-LAS detector using estimated CSIR for the cases of 
1PH-8D and IP-nlD training. From Fig. [13] it can be seen 
that, compared to that of perfect CSIR, the estimated CSIR 
performance is worse by about 3 dB in terms of coded BER 
for 1PH-8D training. With MMSE estimation scheme, lO""* 
coded BER occurs at about 12 — 7.7 — 4.3 dB away from the 
capacity bound for IP-nlD and IPh-SD training. This nearness 
to capacity bound improves by about 0.6 dB for the iterative 
detection/estimation scheme. We note that for the system in 
Fig.[l3]with parameters 16 x 16 STBC, 4-QAM, rate-3/4 turbo 
code, and 1PH-8D training with T = 144, r = 16, we achieve a 
high spectral efficiency ofl6x2x|x| = 21.3 bps/Hz even 
after accounting for the overheads involved in channel esti- 
mation (i.e., pilot matrix) and channel coding, while achieving 
good near-capacity performance at low complexity. This points 
to the suitability of the proposed approach of using LAS 
detection along with iterative detection/estimation in practical 
implementation of large STBC MIMO systems. 

Finally, in Fig. [14] we illustrate the coded BER performance 
of 1-LAS detection and iterative detection/estimation scheme 
for different coherence times, T, for a fixed A'^t = A',. = 16, 



Fig. 13. Turbo coded BER performance of 1-LAS detector for 16 X 16 ILL- 
only STBC with i) perfect CSIR, ii) CSIR using MMSE estimation, and Hi) 
CSIR using iterative detection/channel estimation (4 iterations). Nt = Nr = 
16, 4-QAM, rate-3/4 turbo code, IP+ID (T = 32, r = 16, 13p = 0^ = l) 
and 1P+8D (T = 144, r = 16, /3p = /3d = l) training. 

16 X 16 STBC, 4-QAM, and rate-3/4 turbo code. The various 
values of T considered and the corresponding spectral effi- 
ciencies are: i) T ^ 32, IP-nlD, 12 bps/Hz, ii) T = 144, 
IPh-SD, 21.3 bps/Hz, Hi) T = 400, 1PH-24D, 23.1 bps/Hz, 
and iv) T = 784, 1PH-48D, 23.5 bps/Hz. In all these cases, 
the corresponding optimum pilot/data power allocations in ||43l 
are used. From Fig. [14] it can be seen that for these four cases, 
lO""' coded BER occurs at around 12 dB, 10.6 dB, 9.7 dB, and 
9.4 dB, respectively. The 10"^ coded BER for perfect CSIR 
happens at around 8.5 dB. This indicates that the performance 
with estimated CSIR improves as T is increased, and that 
a performance loss of less than 1 dB compared to perfect 
CSIR can be achieved with large T (i.e., slow fading). For 
example, with 1PH-48D training {T = 784), the performance 
with estimated CSIR gets close to that with perfect CSIR both 
in terms of spectral efficiency (23.5 vs 24 bps/Hz) as well as 
SNR at which 10"^ coded BER occurs (8.5 vs 9.4 dB). This 
is expected, since the channel estimation becomes increasingly 
accurate in slow fading (large coherent times) while incurring 
only a small loss in spectral efficiency due to pilot matrix 
overhead. This result is significant because T is typically large 
in fixed/low-mobility wireless applications, and the proposed 
system can effectively achieve high spectral efficiencies as 
well as good performance in such applications. 

D. On Optimum Nt for a Given Nr and T 

In B3l . through theoretical capacity bounds it has been 
shown that, for a given Nr, T and SNR, there is an optimum 
value of Nt that maximizes the capacity bound (refer Figs. 5 
and 6 in [43 1, where the optimum Nt is shown to be greater 
than Nr in Fig. 5 and less than Nr in Fig. 6). For example, 
for Nr = 16, T = 48, and SNR = 10 dB, the capacity 
bound evaluated using ( i47l ) with optimum power allocation 
for Nt = 12 is 19.73 bps/Hz, whereas for Nt = 16 the 
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Fig. 14. Turbo coded BER performance of 1-LAS detection and iterative 
estimation/detection as a function of coherence time, T = 32, 144, 400, 784, 
for a given Nt = Nr = 16, 16 X 16 ILL-only STBC, 4-QAM, rate-3/4 
turbo code. Spectral efficiency and BER performance with estimated CSIR 
approaches to those with perfect CSIR in slow fading (i.e., large T). 



Fig. 15. Comparison between two IP-l-Af^D training-based systems, one with 
a larger Nt than the other for a given Nr and T. With Nr = 16, T = 48 and 
optimum power allocation in both systems, System-II with Nt = 12 achieves 
a higher spectral efficiency (^13.5 vs 10.33 bps/Hz) while achieving 10~^ 
coded BER at a lesser SNR {S.6 vs 8.9 dB) than System-I with Nt = 16. 



Parameters 


System-I 


System-II 


# Rx antennas, Nj. 


16 


16 


Coherence time, T 


48 


48 


# Tx antennas, Nt 


16 


12 


STBC from CDA 


16 X 16 


12 X 12 


Pilot duration, r 


16 


12 


Training 


1PH-2D 


1PH-3D 




1.2426 


1.4641 




0.8786 


0.8453 


Modulation 


4-QAM 


4-QAM 


Turbo code rate 


1/2 


3/4 


Spectral efficiency 


10.33 bps/Hz 


13.5 bps/Hz 


SNR at 10 3 coded BER 


8.9 dB 


8.6 dB 



TABLE II 

On optimum Nt for a given Nr and T. System-II with a smaller 

Nt ACHIEVES A HIGHER SPECTRAL EFFICIENCY WHILE ACHIEVING 10~^ 
CODED BER AT A LESSER SNR THAN SYSTEM-I WITH A LARGER Nt. 



capacity bound reduces to 17.53 bps/Hz showing that the 
optimum Nt in this case will be less than Nr. We demonstrate 
such an observation in practical systems by comparing the 
simulated coded BER performance of two systems, referred to 
as System-I and System-II, using 1-LAS detection and iterative 
detection/estimation scheme. The parameters of System-I and 
System-II are listed in Table II. Nr and T are fixed at 16 and 
48, respectively, in both systems. System-I uses 16 transmit an- 
tennas and 16 x 16 STBC, whereas System-II uses 12 transmit 
antennas and 12 x 12 STBC. Since the pilot matrix is ^/JIIn^, 
the pilot duration t is 16 and 12, respectively, for System- 
I and System-II. Optimum pilot/data power allocation and 
4-QAM modulation are employed in both systems. System- 
I uses rate-1/2 turbo code and system-II uses rate-3/4 turbo 
code. With the above system parameters, the spectral efficiency 



achieved in System-I isl6x2xix| = 10.33 bps/Hz, 
whereas System-II achieves a higher spectral efficiency of 
12x2x|x| = 13.5 bps/Hz. In Fig. [B] we plot the 
coded BER of both these systems using 1-LAS detection 
and iterative detection/estimation. From the simulation points 
shown in Fig. [15] it can be observed that System-II with a 
smaller Nt and higher spectral efficiency in fact achieves a 
certain coded BER performance at a lesser SNR compared to 
System-I. For example, to achieve lO^'^ coded BER, System-I 
requires an SNR of about 8.9 dB, whereas System-II requires 
only 8.6 dB. This implies that because of the reduction of 
throughput due to pilot symbols (by a factor of for 
a given T and r = Nt), a larger Nt does not necessarily 
mean a higher spectral efficiency. Such an observation has 
also been made in B3l based on theoretical capacity bounds. 
The proposed detection/channel estimation scheme allows the 
prediction of such behavior through simulations, which, in 
turn, allows system designers to find optimum Nt and STBC 
size to achieve a certain spectral efficiency in large STBC 
MIMO systems. 

VI. Conclusion 

We presented a low-complexity algorithm for the detection 
of high-rate, non-orthogonal STBC large-MIMO systems with 
tens of antennas that achieve high spectral efficiencies of the 
order of several tens of bps/Hz. We also presented a training- 
based iterative detection/channel estimation scheme for such 
large STBC MIMO systems. Our simulation results showed 
that the proposed 1-LAS detector along with the proposed 
iterative detection/channel estimation scheme achieved very 
good performance at low complexities. With the feasibil- 
ity of low-complexity high-performance receivers, like the 
proposed detection/channel estimation scheme, large-MIMO 
systems with tens of antennas at high spectral efficiencies can 



16 ACCEPTED IN IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING: SPL. ISS. ON MANAGING COMPLEXITY IN MULTIUSER MIMO SYSTEMS 



become practical, enabling interesting high data rate wireless 
applications (e.g., wireless IPTV/HDTV distribution). This 
can motivate the inclusion of large-MIMO architectures (e.g., 
12 X 12, 16 X 16 MIMO systems, including those using STBCs 
from CDA) into wireless standards Uke IEEE 802.1 In/VHT 
and IEEE 802.16/LTE-A in their evolution to achieve high 
data rates at increased spectral efficiencies. 

Appendix 

Theorem 1: The 4*"^ in (EB minimizes in (O and 

this minimum value is non-positive. 

Then = r+f, where < / < 1, 



A 



Proof: Let r 
and so we can write 



14 'I 

2an 



= 2r + 2/. 



If I 



(fc) 



were unconstrained to be any real number, then the 



optimal value of Ip is — ^ — , which would lie between 
2r and 2r + 2 (as per (|48j). Since is quadratic in 



it is unimodular, and hence the optimal point (with Ip 
constrained) would be either 2r or 2r + 2. Using ( fT9] l and ( l48T l. 
we can evaluate ,F(2r + 2) — J-'{2r) to be 



(fc) 



T{2r + 2)-Ti2r) = 4ap(l - 2/). 



(49) 



Hence (2r) = is non-positive. Similarly, if 

/ > 0.5, then 2r + 2 is optimal, and T{2r + 2) < T{2r). 
However, since 2r is always less than 2—^ — , T{2r) is non- 
positive and therefore JF(2r + 2) = is non-positive. 
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